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Abstract — To harness the ever growing capacity and decreasing cost of storage, providing an abstraction of dependable storage in 
the presence of crash-stop and Byzantine failures is compulsory. We propose a decentralized Reed Solomon coding mechanism with 
minimum communication overhead. Using a progressive data retrieval scheme, a data collector contacts only the necessary number 
of storage nodes needed to guarantee data integrity. The scheme gracefully adapts the cost of successful data retrieval to the number 
of storage node failures. Moreover, by leveraging the Welch-Berlekamp algorithm, it avoids unnecessary computations. Compared to 
the state-of-the-art decoding scheme, the implementation and evaluation results show that our progressive data retrieval scheme has 
up to 35 times better computation performance for low Byzantine node rates. Additionally, the communication cost in data retrieval is 
derived analytically and corroborated by Monte-Carlo simulation results. Our implementation is flexible in that the level of redundancy 
it provides is independent of the number of data generating nodes, a requirement for distributed storage systems. 

Index Terms — Reliability, Availability, Fault tolerance, Error control codes 
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1 Introduction 

COST of storage for data availability over networks has 
decreased drastically over the years. Companies such as 
Google and Amazon offer TB of online storage for free or at 
a very low cost. Also, low-power storage media are widely 
used in embedded devices or mobile computers. However, 
to harness the ever growing capacity and decreasing cost of 
distributed storage for persistent data availability, a number of 
challenges need to be addressed, (i) volatility of storage due to 
network disconnectivity, varying administrative restriction or 
user preferences, and nodal mobility (of mobile devices); (ii) 
(partial) failures of storage devices. For example, flash media 
are known to be engineered to trade-off error probabilities for 
cost reduction; (iii) software bugs or malicious attacks, where 
an adversary manages to compromise enough storage nodes 
to guarantee that integrity cannot be guaranteed. 

To ensure availability and data integrity despite failure 
or compromise of storage nodes, survivable storage systems 
spread data redundantly across a set of distributed storage 
nodes. At the core of a survivable storage system is a coding 
scheme that maps information bits to stored bits, and vice 
versa. The unit of such mapping are referred to as symbols in 
this paper. A (n, k) coding is defined by the following two 
primitives: 

- encode c = (u, n, k) takes as input k information 
symbols u = [uq, «i, ■ ■ • , Uk-i] and returns a coded 
vector c = [crj, Ci, . . . , c„_i]. The coded symbols are 
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stored on storage nodes, one per node. 
- decode u = (r, n, k) accesses a subset of storage nodes 
and returns the original k information symbols from 
possibly corrupted symbols. 

Most existing approaches to survivable and dependable 
storage assume crash-stop behaviors. That is, a storage device 
becomes unavailable if failed (also called "erasure"). Solutions 
such as various RAID configurations j l j and their extensions 
are engineered for high read and write data throughput. In 
this case, typically low-complexity (replication or XOR-based) 
coding mechanisms are employed to recover the original data 
from limited degree of erasure. We argue that Byzantine 
failures, where devices fail in arbitrary manner and cannot be 
trusted, are becoming more pertinent with the prevalence of 
cheap storage devices, software bugs and malicious attacks. 
Efficient encode and decode primitives that can detect data 
corruption and handle Byzantine failures serve as a fundamen- 
tal building block to support higher level abstractions such 
as multi-reader multi-writer atomic register Q and digital 
fingerprints Q in dependable distributed systems. 

For fixed error correction capability, the efficiency of encode 
and decode primitives can be evaluated by three metrics, i) 
storage overhead measured as the ratio between the number 
of storage symbols and total information symbols (n/k); ii) 
encoding and decoding computation time; and iii) communi- 
cation overhead measured in the number of bits transferred in 
the network for encode and decode. Communication overhead 
is of much importance in wide-area and/or low-bandwidth 
storage systems. Even though Reed-Solomon (RS) codes have 
been used for distributed storage for a single system, they 
have been found unsuitable for distributed networked storage 
due to their centralized nature JU and high communication 
overhead Q. However, by encoding data at each data node, 
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we found that RS codes can avoid the above disadvantages 
and provide better performance in almost every aspect than 
existing storage schemes. Hence, in this paper, we propose a 
novel solution to spreading redundant information efficiently 
across distributed storage nodes using incremental RS decod- 
ing. By virtue of RS codes, our scheme is storage optimal. The 
key novelty of the proposed approach lies in a progressive 
data retrieval procedure, which retrieves just enough data 
from live storage nodes, and performs decoding incrementally. 
As a result, both communication and computation cost are 
minimized, and adapt to the degree of errors in the system. We 
provide a theoretical characterization of the communication 
cost and success rate of data retrieval using the proposed 
scheme in presence of arbitrary errors in the system. Our 
implementation studies demonstrate up to 20 times speed-up 
of the progressive data retrieval scheme in computation time, 
relative to a classical scheme. Moreover, the proposed scheme 
is comparable to that of a genie-aid decoding process, which 
assumes knowledge of failure modes of storage nodes. 
In this paper, we make the following contributions: 
> Design of a novel progressive data retrieval algorithm 
that is storage and communication optimal, and computa- 
tionally efficient. It handles Byzantine failures in storage 
nodes gracefully as the probability of failures increases. 

* Development of an analytical model to evaluate the 
communication cost of our data retrieval algorithm. 

• Eliminate the need for the number of data nodes, k, to 
equal the number of information symbols, fc-a constraint 
that is restrictive and unrealistic for general distributed 
storage systems. 

The rest of the paper is organized as follows. Related work 
is given in Section [2] The progressive data retrieval scheme is 
presented in Section [3] with the details of the incremental RS 
decoding algorithm in Section |4] An analysis of our coding, 
communication and success rate complexity is provided in 
Section [5] and Section [6] compares our scheme with leading 
erasure coding protocols. Evaluation results are presented in 
Section [7] Finally, we conclude the paper in Section [8] 

2 Background and Related Work 

In storage systems, ensuring reliability and data integrity 
requires the introduction of redundancy. A file is divided into k 
symbols, encoded into n coded symbols and stored at n nodes. 
One important metric of coding efficiency is the redundancy- 
reliability trade off defined as n/k. The simplest form of 
redundancy is replication. As a generalization of replication, 
erasure coding offers better storage efficiency. The Maximum 
Distance Separable (MDS) codes are optimal as it provides 
largest separation among code words, and an (n, k)— MDS 
code will be able to recover from any v errors if v < [— 1= -J , 
where s is the number of erasures (or irretrievable symbols). 

2.1 Reed-Solomon codes 

RS codes are the most well-known class of MDS codes. They 
not only can recover data when nodes fail, but can guarantee 
recovery when a certain subset of nodes are Byzantine. RS 



codes operate on symbols of m bits. An (n, k) RS code is 
a linear code, with each symbol in GF(2 m ), and parameters 
n = 2 rn — 1 and n — k = 2t , where n is the total number of 
symbols in a codeword, k is the total number of information 
symbols, and t is the symbol-error-correcting capability of the 
code. 

Encoding: Let the sequence of k information symbols in 
GF(2 m ) be it = [uo,ui, . . . ,Uk-i] and u(x) be the infor- 
mation polynomial of u represented as 

u(x) = U + UlX + ■ • • + Ufe_l.T fc_1 . 

The codeword polynomial, c{x), corresponding to u{x) can 
be encoded as 

c(x) = u{x)g(x) , 

where g{x) is a generator polynomial of the RS code. It is 
well-known that g{x) can be obtained as 

g{x) = (x - a b )(x - a b+1 ) ■ ■ ■ (x - a^ 2 *" 1 ) 

= 3o + gix + gtx 2 H h g2tx 2t , (1) 

where a is a primitive element in GF(2 m ), b an arbitrary 
integer, and g t € GF(2 m ). 

Decoding: The decoding process of RS codes is more com- 
plex. Complete description of decoding of RS codes can be 
found in J6). 

Let r(x) be the received polynomial and r(x) = c(x) + 
e(x) + 7(2;) = c(x) + X(x), where e(x) = Y^j=o e i x ^ * s trie 
error polynomial, 7(2;) = X^j=o li^ tne era sure polynomial, 
and X(x) = Y^j=o ^j x ^ ~ e ( x ) +7( x ) tne errata polynomial. 
Note that g{x) and (hence) c(x) have a b , a b+1 , . . . , a 6+2 * _1 
as roots. This property is used to determine the error locations 
and recover the information symbols. 

The basic procedure of RS decoding is shown in Figure [TJ 
The last step of the decoding procedure involves solving a 
linear set of equations, and can be made efficient by the use 
of Vandermonde generator matrices 0. 

In GF(2 m ), addition is equivalent to bit-wise exclusive- 
or (XOR), and multiplication is typically implemented with 
multiplication tables or discrete logarithm tables. To reduce the 
complexity of multiplication, Cauchy Reed-Solomon (CRS) 
codes [8 1 have been proposed to use a different construction 
of the generator matrix, and convert multiplications to XOR 
operations for erasure. However, CRS codes incur the same 
complexity as RS codes for error corrections. 

2.2 Existing work 

Several XOR-based erasure codes (in a field of GF(2)) 0, 0- 
iTm have been used in storage systems. In RAID-6 systems, 
each disk is partitioned into strips of fixed size. Two parity 
strips are computed using one strip from each data disk, 
forming a stripe together with the data strips. EVEN-ODD 
ifTTI . Row Diagonal Parity (RDP) [9|, and Minimal Density 
RAID-6 codes [ 1 1 use XOR operations, and are specific to 
RAID-6. A detailed comparison of the encoding and decoding 
performance of several open-source erasure coding libraries 
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Fig. 1 . Block diagram of RS decoding. Above each block, the corresponding existing algorithms are indicated. 



for storage is provided lfl2l . We mention that the gain in com- 
putation efficiency of XOR-based erasure codes is achieved 
by trading off fault tolerance. Our progressive data retrieval 
algorithm-however- can tolerate as much fault-according to 
the configuration of the code's robustness-as is needed and 
is highly efficient both in computation and communication 
costs. Moreover, RAID-6 systems can recover from the loss 
of exactly two disks but cannot handle Byzantine failures, 
thereby eliminating the application of such systems for sensor 
networks. 

In the context of network storage for wireless sensor net- 
works, randomized linear codes |4| and fountain codes |[5) 
have been applied with the objective that a data collector can 
retrieve unit data from each of k data sources by accessing 
any k out n storage nodes, and thus up to n — k crash-stop 
node failures can be tolerated. However, such schemes cannot 
recover from data modifications in the field. Compared to 
erasure based solutions, the key distinctions are i) coding is 
done at the storage nodes rather than at the data source, and 
ii) each storage node only has unit capacity. Later, we provide 
a reference implementation of a single data collector problem 
using the proposed primitives. Our evaluation studies shows 
that our implementation outperforms the distributed storage 
scheme based on random linear network coding in almost all 
metrics. 

3 Progressive Data Retrieval 

We use the abstractions of a data node which is a source of 
information that must be stored, and a storage node which cor- 
responds to a storage device. Nodes are subject to both crash- 
stop failures, where data cannot be accessed and Byzantine 
failures, where arbitrary data may be returned. The communi- 
cation cost of transferring one unit of data from the data source 
to a storage node is assumed to be constant independent of 
the location of the storage node. Unlike existing decentralized 
schemes for distributed networked storage, the message length 
in each encoding process of the proposed scheme is not tied 
with the number of data node, k. Hence, the RS code used 
in the proposed scheme is denoted as an (n, k) code. The 
scheme given in |f]~3l is a special case when k = k. It will be 
shown that the value of k affects the storage efficiency and 
the communication cost. 

3.1 Data storage 

The data storage scheme consists of two steps. First, for data 
integrity, a message authentication code (MAC) is added to 
each data block generated by a data node before it is encoded. 
One -way hash functions such as MD5, SHA-1, SHA-2 can be 
used. For simplicity, we adopt CRC code for error detection 



with r redundant bits J6), 03) ■ The portion of errors that 
cannot be detected by a CRC code is dependent only on 
its number of redundant bits. That is, a CRC code with r 
redundant bits cannot detect (i)100% portion of errors. If 
To is the size of the original data with header information, 
then the size of the resulting data with CRC is T = Tq + r. 
It is easy to see that the CRC overhead can be amortized by 
using large data blocks. 

In the second step, a data block is partitioned into infor- 
mation symbols of length m bits and RS codes are applied. 
The data node divides its data into \T/m\ symbols such 
that each symbol represents an element in GF(2 m ). Next the 
\T/rri\ symbols are divided into information groups 

each of k symbolsQ Let k symbols of the z-th group be the 
components in information vector iti = [uio,un, 

' \TJm\ ' 
k 

<<k(n-i)] w i m n symbols as 



, it. 



where 1 < 

Cj = [do, Cji, . 



< 
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The node encodes m, into 



Ci = u t G, 



where 
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(a 2 )'*- 1 (a 3 )'*- 1 ■ 





(2) 



Recall that a is a primitive element (generator) of GF(2 m ) 
which can be determined in advance. The data node then packs 

and sends them with their index j 



all Ci j, < i < 



\T/m~\ 
k 



to storage node (j + 1) via the network. 
3.2 Data retrieval 

To reconstruct the source data, a collector needs to access 
sufficient number of storage nodes to ensure data integrity. 
Among n storage nodes, let the number of erasures, which 
includes the number of crash-stop nodes and the number of 
nodes that have not been accessed, be s. Identity of crash-stop 
nodes can be determined by the use of keep-alive messages. 
Additionally, there are v nodes with Byzantine failures. Nei- 
ther v nor the identity of these nodes are known to the data 
collector. 

G given in (f2]i is a generator matrix of a RS code ||6]] and 
thus an error-erasure decoding algorithm can recover all data 
if there is no error in at least k encoded symbols. Without 

1. CRC codes are added to every information group. Hence, the data size 
T includes those bits added by CRC codes. The last information group may 
have less than k symbols. In this case, zeros are appended during the encoding 
procedure. 



4 



loss of generality, we assume that the data collector retrieves 
encoded symbols from storage nodes jo, ji,...jt_ v If no 
error is present, the k symbols in the i-th group of any data 
node can be recovered by solving the following system of 
linear equations: 

[Uio, Uji, . . . , u i ^_ 1 ^]G = [cij , Cij 1 , . . . , Cjj._J , (3) 

where 



G = 



1 



(a JO ) 2 



in2 



1 

(a»'«-i) 2 



fc-i 



G can be constructed by the primitive element and the index 
associated with Cij d , < d < k — 1. 

When the number of erroneous (or compromised) nodes is 
unknown but is bounded, the proposed progressive procedure 
for data retrieval minimizes communication cost without any 
prior knowledge regarding failure models of nodes. 

From Section [2] we know that RS codes can recover from 
any v errors if v < j . Therefore, if the number of 

compromised nodes (v) is small, more erasures (s) can be 
tolerated, and less nodes need to be accessed (by treating 
them as unavailable). The data retrieval procedure proceeds 
in stages. At stage I, I errors are assumed. If RS decoding 
fails or the decoded information symbols fail the CRC check, 
there must exist more erroneous nodes than RS error-erasure 
decoding can handle at this stage. In order to correct one more 
error, two more symbols need to be collected, since the number 
of erasures allowed is reduced by two. Therefore, the total 
number of symbols retrieved at stage J is k + 21. 

This procedure is clearly optimal in communication costs 
as additional symbols are retrieved only when necessary. 
However, if applied naively, its computation cost can be 
quite high since RS decoding shall be performed at each 
stage. For example, when n = 1023, k = 401, with 1% 
error probability defined as probability that a storage node 
is faulty, our analytical results from Section [5] show that- 
on average-409.2 storage nodes need to be accessed. That 
is, the decoding needs to be done |~ 409,2 2 ~ 401 ] = 5 times. 
On the other hand, consider a naive scheme that retrieves 
coded symbols from each of n storage nodes and decodes 
only once. The naive scheme may incur less computation, 
but suffers from a high communication cost. Such trade-offs 
between computation and communication are avoidable as 
we show in Section |4] where we devise an algorithm that 
can utilize intermediate computation results from previous 
stages and perform RS decoding incrementally. Combined 
with the incremental decoding of stored symbols, the proposed 
progressive data retrieval scheme (detailed in Algorithm [TJ is 
both computation and communication efficient. For simplicity, 
Algorithm Q] is presented only for one group of encoded 
symbols. It is applied to all groups of encoded symbols to 
retrieve all the original data. 

In Algorithm [T] for each i (or accordingly stage I = 
(i — k)/2 where the number of errors v > I), the decoding 



Algorithm 1: Progressive Data Retrieval 
begin 

i <— k; 

The data collector randomly chooses k storage nodes 
and retrieves encoded data, Ci = [cj , Cj 1 , . . . , Cj j6 _ 1 ]; 

Ti — Ci 

repeat 

u = r i G~ 1 ; 

if CRCTest(u) = SUCCESS then 

Delete CRC checksum from u to obtain uq; 
return «o; 

else 

repeat 

i 4- i + 2 

Two more encoded data from remaining 
nodes i\,t2, are retrieved 

Ci <r- Ci-2 U {c h ,Ci 2 } 

until {(n = IRD( Ci )) = SUCCESS || 
i > n — 1} ; 

until i > n — 1 ; 
return FAIL; 
end 



process declares errors in one of two cases. In Line Q] the 
proposed incremental RS decoding algorithm (IRDQ) may 
fail to produce decoded symbols. Otherwise, in Line [T] the 
decoded symbols fail the CRC check. Our implementation 
(Section |7]i shows that the former happens frequently. Thus, in 
most cases, CRC checking is carried out only once throughout 
the entire decoding process. 

4 Progressive Decoding 

In this section, we present the incremental RS decoding 
algorithm. Compared to the classic RS decoding, it utilizes 
intermediate computation results and decodes incrementally 
as more symbols become available. 

4.1 The basic algorithm 

Given the received coded symbols [ro, r\, . . . , r n ] with era- 
sures set to be zero, the generalized syndrome polynomial 
S(x) can be calculated lfT31 as follows: 



3=0 



3=0 



x — CO 1 



(4) 



where T(x) is an arbitrary polynomial with degree (n — k). 
Assume that v errors occur in unknown locations j\ , fa, . ■ . , j v 
and s erasures in known locations mi,m2, ■ ■ ■ ,m s of the 
received polynomial. Then 



and 



e(x) = ej 1 x 31 + ej 2 x 32 + ■ ■ • + ej v x Jv 



j{ X ) = lmi X mi + Jm 2 X m2 +■■■+ l m X m ° , 



5 



where Sj e is the value of the £-th error, I = 1, • ■ ■ , v, and 
j me is the value of the £-th erasure, i = 1, • • ■ , s. Since the 
received values in the erased positions are zero, r y me = —c mt 
for I = 1, ■ ■ • , s. The decoding process is to find all ji, ej e , 
mi, and j mt . Let E = {ji,--- ,j v }, M = {mi,-- - ,m s }, 
and D = E U M. Clearly, E n M = 0. It has been shown that 
a key equation for decoding is 



A(x)5(x) = *(x)T(x) + fi(x) , 



(5) 



where 

A(x) 

*(x) 



IK 



X — or = 



n(*-on 



A E (x)A M (x) 



(6) 

(7) 
(8) 



■ ED 



If 2u + s < n — k + 1, then (O has a unique solution 
{A(x), ^(x), f2(x)}. Instead of solving (0 by either the 
Euclidean or Berlekamp-Massey algorithm we introduce a 
reduced key equation lfT5l that can be solved by the Welch- 
Berlekamp (W-B) algorithm (6). It will be demonstrated that 
by using W-B algorithm and the reduced key equation, the 
complexity of decoding can be reduced drastically. Let T = 
{j\T(a J ) = 0}. Let a set of coordinates U C {0, 1, . . . , n— 1} 
be defined by U = M n T. A polynomial Au(x) is then 
defined by Au(x) = Iljeu ( x — op) , which is known for 
the receiver since T(x) and M are both known. Since Au(x) 
divides both A(x) and T(x), according to (O, it also divides 
£7(x). Hence, we have the following reduced key equation: 



A(x)5(x) = *(x)T(x) + n(x) , 



where 



A(x) 
T(x) 

n(x) 



A(x)Au(x) 
f(x)Au(x) 
f2(x)Au(x) 



Note that A(x) is still a multiple of the error location poly- 
nomial Ae(x). The reduced key equation can have a unique 
solution if 



deg(0(x)) < dcg(A(x)) < 



1 



IUI 



(10) 



where deg(-) is the degree of a polynomial and |U| is the 
number of elements in set U. 

For all j e T\U, by ©, we have 



A(a j )S(a j ) = Q(a j ) 



(11) 



since T(a J ) = 0. Note that ot 1 is a sampling point and S(pt?} 
the sampled value for (TTTb . The unique solution {A(x), H(x)} 
can then be found by the W-B algorithm with time complexity 
0((n — k — | XJ | ) 2 ) (6). Once all coefficients of the errata 
polynomial are found, the error locations jg can be determined 
by successive substitution through Chien search lfl6ll . When 



the solution of © is obtained, the errata values can be 
calculated. Since there is no need to recover the errata values 
in our application we omit the calculations. In summary, there 
are three steps in the decoding of RS codes that must be 
implemented. First, the sampled values of S(a : ') for j £ T\U 
must be calculated. Second, the W-B algorithm is performed 
based on the pairs (at?, S^a 3 )) in order to obtain a valid A(x). 
If a valid A(x) is obtained, then error locations are found by 
Chien search; otherwise, decoding failure is reported. 

4.2 Incremental computation of S(x), A(x), O(x) 
Let us choose 



T(x) = (x 



>)(x 



, 



where mi are those corresponding positions of missing data 
symbols after the data collector has retrieved encoded symbols 
from k storage nodes. In the decoding process, these are erased 
positions before the first iteration of error-erasure decoding. 
Let Uo = {mo i • ■ • )Wi The generator polynomial of 

the RS code encoded by (O has a n ~ k , a n ~ k ~ 1 , . . . , a as roots. 
The error-erasure decoding algorithm is mainly based on W-B 
algorithm which is an iterative rational interpolation method. 

In the £-th iteration, t errors are assumed in the data and the 
number of erasures is n — k — 11. Let + 1) and (j^' + 1) 
be the two storage nodes just accessed in the £-th iteration. 
Let XJe = LV 
will find AW (x 



Based on the W-B algorithm 



and fl^ (x) which satisfy 
AW(a J )S w (a J ) = tibial) for all j £ U \U £ 



where S'^^x) is the generalized syndrome with = 
for all ri £ XJ(. It has been shown that dcg(A^)(x)) > 
deg(OW(x)) for any t by a property of W-B algorithm. Thus, 



(9) if dcg(AM(x)) < 



n-fe+l+|U£ 



IU, 



1/2, then the 



unique solution will exist due to ( TTOb . By the definition of 
generalized syndrome polynomial in (@), for i £ Uo\Uf, we 
have 



Er 

ri-1 

E 

3=0 

n-1 „ 

E 3 



a 1 — a? 
i T(a j ) 



a? — a* 



+ r ia l T(a l ) 



.1=" 



a? — a' 1 



+ r l a t T'(a t ), (12) 



where T'(x) is the derivative of T(x) and Fj = rja^T(a : '). 
Note that r'(a*) = ]\ 3 <=v (a* - a 1 ™ 3 ) . It is easy to see that 

S^ e '(a l ) is not related to any r.j, where j £ Uo and j ^ i. 
Hence, S^Xa 1 ) = S^\a l ) for all i £ U \U^_i. This fact 
implies that all sampled values in previous iterations can be 
directly used in current iteration of the W-B algorithm. 

Define vwk[N(x), W(x)\ = max[2deg(W(x)), 1 + 
2 dcg(A^(x))]. The incremental RS decoding algorithm is 
described in Algorithm [2] Upon success, the incremental 
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RS decoding algorithm returns k non-error symbols. The 
procedure will report failure either as the result of mismatched 
degree of the error locator polynomial, or insufficient number 
of roots found by Chien search (Line |2). In both cases, no 
further erasure decoding is required. This reduces the decoding 
computation time. 



Algorithm 2: Incremental RS Decoding IRD 
init : Calculate Fj given in (fT~2l > for all j ^ Uo. 

£<-0; lW(x) 4- 1; 

n(°)(a?) <- 0; <- o,e(°)(s) <- 1. 



input : stage I, two new symbols at storage nodes 

(j? + 1) and (jf + 1) 
output: FAIL or non-error symbols r 
begin 

foreach i = 1, 2 do 

| ,fV/' andyf^S^xf) 
end 

for i = 1 to 2 do 

if &f _1) = then 

A T (a:) <- A^fx); ^ T (x) <- ^-^(x); 
9 T (a;) <- (i-jj^et'-^i); 
$ T (x) <- (x-xJ^JS^Cx) 



else 



end 



G T (x) <- (x-xnf^- 1 )^); 
$ T (x) <- (x-xf ) )A^- 1 )(x); 
&T(x) ^ ftj'-^eC-D^) -af-^O^-^Cx); 
A T (x-) <- bf- 1] ^- l \x) - af-^A^ix). 

end 

if mnk[n T {x), A T (x)} > rank[Q T {x), <5> T {x)] 
then 

| swap [il T (x),A T (x)} O {<d T (x),$ T (x)}. 
end 

if i = 1 then 

f2^- x )(x) «- tt T (a;); A^-^x) <- 
G^-i^x) <- 6 T (x), <- $ T (x); 

else 

QW(x) <- ft T (x); AW(x) «- f) T (x); 
9^)(x) <- 9 T (x); <- $ T (x). 

end 
end 

if deg(AW(x)) ^ then 
| return FAIL; 
end 

NumErrorLoc = ChienSearch(A^^(x)). 

if NumErrorLoc > n — k \\ NumErrorLoc ^ 

deg(A^)(x)) then 
| return FAIL; 
end 

return k non-error symbols r; 



5 Complexity Analysis 

This section provides a complexity analysis for data storage 
and retrieval in Sections 15.11 and 15.21 respectively. Both have 
computational and communication costs associated with them. 
Specifically, data storage is composed of both an encoding and 
data dissemination phase, while data retrieval is composed of 
both an incremental collection and decoding phase. Included 
in Section 1531 are Monte-Carlo simulations that are consistent 
with our data retrieval complexity analysis. Finally, in Sec- 
tion 15.31 the benefit of relaxing the k = k constraint that was 
imposed by our previous work lfT3l . is provided. 

5.1 Data Storage 

From Section I3.ll the communication cost incurred by the 
encoded data generated by a data node is nm bits. 
The total communication cost is then a factor of k more. Also, 
it is easy to see that the total bits stored in each storage node is 



km 



\Thn\ 



, which is approximately T when k = k and T is 
much larger than mk. Assuming a software implementation on 
field operations without use of look-up tables, the computation 
complexity of encoding can be estimated as follows. Given that 
computation of one multiplication in GF(2 m ) is of m 2 bit 



exclusive-ORs. At the data node, kr 



\TJm\ 
k 



are performed, which is equivalent to 
exclusive-ORs. 



kri 



multiplications 



\T/rn\ 



bit 



5.2 Data Retrieval 

Given a set of coded symbols, Section I5.2.1I analyzes the 
computational decoding costs. Then a derivation of the com- 
munication complexity is provided in Section [5. 2. 21 

5.2. 1 Decoding 

In the subsequent complexity analysis, the worst case is 
assumed, namely, no failure on decoding is reported in Al- 
gorithm |2] (Line |2j, and the algorithm runs to completion. 

In CRC checking, one polynomial division is performed. 
Since the dividend is of degree T — 1 and the divider is of 
degree r, the computation complexity is 0{Tr). 

Let v be the number of errors when the decoding procedure 
is completed. In the £-th iteration, £ errors are assumed in the 
data and the number of erasures is n — k — 2£. We first need to 
calculate two syndrome values. This can be obtained by the Fj 
calculated initially. For instance, in the first iteration, according 
to ( fT2l . the computation complexity is of 0(k(n — k)) since 
there are k Fj's to be calculated and each is a product of 
n — k terms. In the next iteration, two more symbols are added 
to Hence, the updated syndrome values can be obtained 
by an extra 0(k) + 0(n — k) computations. To find the error- 
locator polynomial, the W-B algorithm is performed two steps 
in each iteration with complexity 0(£). Since we only consider 
software implementation, the Chien search can be replaced by 
substituting a power of a into the error-locator polynomial. 
It needs to test for at most k + £ positions to locate k non- 
error positions such that it takes 0((k + £)£) computations. 
Finally, inversion of a Vandermonde matrix G can be done in 
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0(k log 2 k) IfTTl , though for implementation purposes, we use 
a 0{k ) inversion algorithm (see, e.g., Q). In summary, the 
computation in the i-ih iteration for I > 1 is 



L v (£) 



0{k 2 ) + 0(n - k) + 0(kl + i 2 ) . 



Counting for v iterations and the complexity of calculating Fj 
we have 



L v = 0(vk 2 ) + 0(k(n - k)) - 
+0(v(n - k)) + 0(v 3 ) 



o(v 2 k) 



(13) 



Note that the computation complexity is measured by finite 
field multiplications, which is equivalent to m 2 bit exclusive- 
ORs. Since the correctable number of errors v is at most 
(n — k)/2, the decoding complexity is at most 0(k(n — k) 2 ). 
For small v, the second term 0(k(n — k)) dominates, which 
corresponds to syndrome computation. Note that the decoding 



procedure needs to be performed k 
decode all data. 



\T/m] 



times in order to 



5.2.2 Communication 

In this section, we provide a probabilistic analysis of the cost 
of communication by determining the number of stages the 
algorithm needs to take, and the probability of successful 
execution. Given n storage nodes and a (n, k)— MDS code, the 
minimum and maximum number of storage nodes to access 
in the proposed scheme is k and n respectively. We assume 
that the CRC code always detect an error if it occurs. Without 
loss of generality, we assume that all failures are Byzantine 
failures, since s crash-stop failures can be easily modeled by 
replacing n with n — s. An important metric of the decoding 
efficiency is the average number of accessed storage nodes 
when the probability of compromising each storage node is p. 
Failure to recover data correctly may occur in two cases. First, 
v > n— k, i.e., there are insufficient number of healthy storage 
nodes. Second, L 11 ^] < v < n — k, in which the sequence of 
accessing determines the outcome (success or failure) of the 
decoding process. For example, if the first v nodes accessed 
are all compromised nodes, correct decoding is impossible. In 
both cases, the decoding algorithm stops after n accesses and 
declares a failure. The communication cost is n. The main 
result is summarized in the following theorem. 

Theorem 1: With the progressive data retrieval scheme, the 
average number of accesses as well as the probability of 
successful decoding are given in Eq. (fl4l i and ( TT~5T > respectively. 

The details of the proof can be found in the Appendix, 
and numerical backing of this analysis is illustrated in what 
follows. 

Numerical Corroboration: We verify the correctness of the 
analytical model using Monte-Carlo simulations implemented 
in Matlab. Figure [2] shows the distribution of the number 
of storage nodes accessed when the algorithm terminates, 
and the number of iterations correspond to the number of 
node accesses during data retrieval. The bar charts depict 
histograms from Monte-Carlo simulations with 5000 runs, and 
the curves represent the numerical results from our analytical 



k =100 

Nmorical 30 

Nmorical fiO 

|^^Mitc-C)rlo 15 
^Mitc-GdoliO 
^Mitc-Ou-lo 60 




60 80 100 

Number of iterations 



Fig. 2. The effect of k on the data retrieval cost for a 
(127, fc)-MDS; k = 30. The error probability here is p = 
0.2; k G {|,fc,2fc}. 



model. We choose n = 127, k = 30 and p = 0.2 so 
that 5000 runs give sufficient statistics in the simulations. 
From Figure [21 it can be observed that the analytical results 
agree well with the Monte-Carlo simulations. Note that the 
number of information symbols, k, yields different distribution 
results. Specifically, increasing k reduces storage-as derived 
in Section 15.11 However, Figure |2] shows that the expense-in 
terms of data retrieval-is undesirably high. 
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Fig. 3. Number of storage nodes accessed as a 
function of the probability of malicious attacks for a 

(1023,fc)-MDS; k = 101. 



Next, we fix n — 1023, and vary the number of information 
symbols, k, from 101 to 401 and the error probability p from 
to 0.3, while keeping the number of data nodes constant. 
Figure [3] shows the increasing communication cost as the 
probability of failures increases. The number of crash-stop 
failures is set to zero, and all Byzantine failures result in 
incorrect data. Clearly, when the error probability p is small, 
the communication cost is close to k. And when p increases, 
the communication cost monotonically increases, as expected. 
We also analyze the success rate of decoding. And observe that 
for k e {101,201,301} and p £ {0, 0.05, 0.3}, decoding 
will always be successful. However, for k ~ 401, decoding 
is always successful only for p e {0, 0.05, 0.25}. When 
p = 0.3, the probability of successful decoding is only 60%. 
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n-k min(u,L=-j— J ,n-v-k) I n-v will f 

„=o VJ £S (ai+S-i) i + n-(2i + fc-l) 



r min(u, [ n ^ fc J — v — k) / n — v 



V 



fe n — u— (i + fc — 1) 



E O^-p)' 

l> — n — fc+ 1 



(14) 



n-k min(u,L^2-^J,n-ii-fe) / n-v wo - , ■ , t -, \ 

S U £S Ch+S-i) i + n-(2i + k-l) 



(15) 



5.3 The Dynamics of k 

One advantage of the proposed scheme is that the number 
of information symbols, k, is not tied to the number of data 
nodes k. Hence, one may choose appropriate values of k 
and k for any given application. For example, in wireless 
sensor networks, data nodes are power-limited but the given 
data collector typically has no power constraint. In such 
applications, one should reduce the cost to disseminate coded 
symbols from data nodes to storage nodes, and shift the cost 
to the data collection phase. This can be done by choosing 
k > k. As shown in Section [5T1 the total communication cost 
is 

knm [JHW 
k 

such that this cost roughly varies linearly with the ratio of k 
and k. If one takes k = 2k, then this dissemination cost is half 
the dissemination cost of k = k. However, from our analytical 
model, we see that the data retrieval cost is then doubled. 

6 Relative Analysis 

In this section, we compare the proposed scheme (IRD) to 
Decentralized Erasure Codes (DEC) proposed by Dimakis et 
al. 0, and Decentralized Fountain Codes (DFC) proposed by 
Lin et al. 0. To make this comparison, we assume that there 
are k data nodes that contain the data to be redundantly stored 
amongst n storage nodes. The k data nodes collectively gener- 
ate kTo bits, where there are To bits per node. As mentioned 
in Section 13.11 each IRD data node adds r CRC bits to its 
data. For brevity, we therefore will write T = T + r for the 
data size of each IRD data node. To facilitate understanding, 
we set k — k in all schemes utilizing information symbols. 

In this analysis, data and storage nodes' data can be 
partitioned into data and storage symbols respectively. The 
number of bits in a data symbol is always m. Depending on 
the storage scheme, a storage symbol may be larger than m. 
Multiplication of a mi-bit and m,2-bit symbol, mi > m,2, 
requires m\m2 XORs, while an addition requires only mi, 
where the field of operation for the symbols is F mi . 

Similar to Section 13.11 encoding and decoding for all 
schemes is done in groups. DEC and DFC both have 1 data 
symbol per data node per group, while IRD encodes k data 
symbols per data node per group. Consequently, each data 



node in DEC/DFC and IRD has 2a and groups per 
data node, respectively. As we will see, although fountain 
coding minimizes the encoding and decoding complexities, 
IRD minimizes communication significantly, especially for 
decoding. 

Quantifying the performance over storage codes requires a 
comparison over the metrics shown in Table [TJ Respectively, 
these metrics include the n storage nodes' redundancy-total 
bits stored; storage nodes' overhead required for decoding; 
dissemination cost-communication cost between data and 
storage nodes; collection cost for 1/fc and all of the orig- 
inal data, respectively; encoding computation cost; a data- 
collector's decoding computation cost; ability to detect and 
correct errors; and finally, ability to deterministically guar- 
antee reconstruction of the original data. We now provide a 
qualitative comparison amongst three storage schemes, based 
on these metrics. 

6.1 Decentralized erasure codes 

DEC have been applied in wireless or wired networks, where 
a data collector accesses any k out of n storage nodes for 
data reconstruction. Each storage node selects random and 
independent coefficients in a finite field ¥ q , and stores a linear 
combination of all the received data (modulo q). Randomized 
linear codes are used, where each data node routes its packet 
multiple times to at least ? log k storage nodes, so that the n 
storage nodes collectively store (^logfc) kTo bits. 

The storage complexity per storage node is ^ ■ logq bits, 
which holds because i) arithmetic is done in W q , which means 
each stored symbol has log q bits, and ii) there are 2s. groups. 

The overhead can be calculated as follows. Since each 
storage node stores the linear combination coefficients and 
there are logfc data nodes connected to a storage node, 
the overhead-to store the coefficients-per storage node is 
log fc(log<7 + log k + log 2a) bits. For any storage node, note 
that the log k outside the parentheses denotes the number of 
nodes connecting to the storage node, while the logfc term 
inside the parentheses are the bits to identify any connecting 
data node |4l|. The last parenthesized term identifies the coding 
group. 

The DEC dissemination cost is given by 

k ■ — log k ■ — ■ m bits 

k m 
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TABLE 1 

Performance comparison of erasure coding schemes for storage 
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because i) there are fc data nodes, ii) each data node repeatedly 
sends its data out ? log fc times, and iii) there are 2a groups. 

Since the code structure is inherently random, fc nodes must 
be contacted in order to obtain any one symbol. Specifically, 
fc symbols are collected from fc storage nodes to reconstruct 
the generated data, per group. 

The encoding cost per storage node is given by: 

XI 

(m log q) ■ log fc • — XORs 

m 

because i) the cost of a linear combination (multiplication) is 
ui log q bit operations, since a combination coefficient and a 
data symbol are \ogq and m bits respectively, ii) a storage 
symbol is the result of logfc linear combinations, and iii) 
there are 2a groups to be encoded. Similarly, the decoding 
complexity to reconstruct the entire data is given by: 

(log 2 q)-k 3 ■— XORs. 
m 

This complexity can be derived from the following: i) a 
multiplication costs log 2 q bit operations, since each storage 
node stores g-bit symbols, ii) matrix inversion is performed, 
and iii) there are 2a groups to be decoded. 

Although DEC can be efficiently constructed, error- 
detection and correction are both infeasible: Assuming the use 
of CRC for error-detection, a data-collector must continue to 
enumerate all possible fc symbols from fc out of n storage 
nodes, until the original data can be reconstructed correctly. 
Therefore, DEC cannot be applied to dependable storage 
systems where data integrity is desired in the midst of errors 
and malicious users. 

6.2 Decentralized fountain codes 

DFC is a decentralized LT code lfT8l . and were proposed 
for the special purpose of guaranteeing data availability in 
the presence of crash-stop failures for networks with several 
data generators and storage nodes. Storage node, s,;, where 
i = 1, n, chooses a degree, di, which is defined as the num- 
ber of data symbols from which to form a linear combination. 
Si then linearly combines di data symbols-using the XOR 
operation-from d t arbitrarily chosen data generators. DFC- 
like other fountain codes-trades communication for compu- 
tation: decoding requires more than fc storage nodes to be 



contacted, though both encoding and decoding computations 
are linear in the number of original symbols. In performance 
evaluation, we assume that 1 — (5 is the probability of successful 
decoding. Different from DEC, instead of pulling data from 
candidate data nodes, a deterministic and probabilistic scheme 
is devised to push data from the source nodes to storage 
nodes 0. Aside from using fountain codes, the authors use 
random walks to remove the need for a geometric routing 
protocol for propagating data from data nodes to storage nodes. 

The storage complexity per storage node is . m bits 
because there are 2a groups and m bits per storage symbol. 
Note that-unlike DEC-the number of bits in a storage symbol 
is independent of the size of the operating field. 

The overhead complexity per storage node is given by 

— ■ log ^ • (log fc + log — ) bits. 

mo m 

The derivation here is similar to that of DEC with the 
following differences: the average degree of a storage symbol 
is log j and there are no linear combination coefficients, since 
every linear combination is simply an XOR of a set of data 
symbols. 

The dissemination cost for DFC is given by 

i k To k-* 
n ■ log — ■ — • m bits. 

o m 

This holds because i) there are n storage nodes, and ii) each 
node stores log -| symbols on average. 

DFC is an LT code that is not systematic. Also, to recon- 
struct any data, more than fc nodes must be contacted. Specif- 
ically, the communication cost to collect all data symbols is 
given by 

T -( 1+1J w-) bi,s ' 

This holds because fc + vfc log 2 h symbols must be collected 
for successful data reconstruction of any fc symbols. 

Encoding and decoding can be very efficient in DFC. The 
encoding complexity per storage node is given by 

m ■ log — ■ — XORs 

6 771 
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since i) the cost of an XOR of two m-bit symbols is m XORs 
and ii) each encoded symbol is the XOR of log k m-bit data 
symbols, where log | is the average degree of an encoded 
symbol. In a similar manner, the decoding for DFC codes is 
given by 

— • k log — • m XORs 

m 5 

because DFC uses the k log -| belief propagation algorithm 
for decoding. DFC is most efficient in decoding. However, 
like all fountain codes, the decoding efficiency comes at a 
communication trade-off that is determined by the parameter, 
5. 

6.3 IRD 

Each data node in IRD encodes its own data, with k symbols 
per group to the n storage nodes. Therefore, altogether there 
are -^r groups. Note that IRD has the same storage complexity 
as DFC even though the numbers of groups and data symbols 
per group differ. Because IRD utilizes a code structure known 
to all storage nodes, it has the minimum overhead complexity 
per storage node: 

T ( T \ 

— - • k log k + log — - bits. 

mk \ mk J 

Unlike DEC and DFC, IRD does not replicate transmissions 
to storage nodes, and therefore its dissemination cost exactly 
equals its storage complexity, leaving IRD with the minimum 
dissemination cost as well. IRD is also preferable because it 
is systematic, allowing partial collection of a subset of data. 
Since k storage nodes store the data nodes' data in original 
form, anyone of these k storage node can be contacted to 
collect \/k -portion of the original data, for all groups. 
This is particularly important where not every sequence of 
generated data is immediately significant to a data-collector. 

Encoding for all groups and all data nodes yields 

T 

k-m 2 -nk- — XORs 
mk 

because i) each node performs encoding, ii) we use a classical 
matrix multiplication, and iii) all groups are encoded. From 
Section 15.2.11 and iterating over all groups, the decoding 
complexity is Tmk 2 XORs in the absence of erroneous storage 
nodes, and given by 

Tm{vk 2 + k(n - k) + v 2 k + v(n - k) + v 3 ) XORs 

in the presence of v erroneous storage nodes. Note that IRD is 
also the only erasure coding scheme to detect and efficiently 
correct errors. Moreover, IRD adapts decoding computation 
to the number of erroneous nodes. Finally, neither DEC nor 
DFC are suitable for real time dependable applications, since 
they are not deterministic. That is, their ability to decode is 
probabilistic and cannot be guaranteed. 

7 Implementation and Evaluation 

We have implemented the proposed and baseline algorithms, 
where each data node's information is a memory buffer in 
a single machine having 2.66 GHz Intel Xeon CPU, 4096 



KB cache and 2 GB RAM. A randomly generated message 
is first partitioned into either 101 or 401 information symbols 
and then encoded into n = 1023 coded symbols of length 
10230 bits. Thus, the field size is 2 10 = 1024. A stored 
symbol is corrupted with an error probability p independently. 
Comparing our error-erasure code to either decentralized or 
fountain erasure codes for error-correction performance is 
pointless, since these codes cannot feasibly guarantee data 
availability in the presence of errors. Therefore, in this section, 
the following three algorithms are considered. 

« BMA is the Berlekamp-Massey (BMA) algorithm for 
RS decoding. Similar to AlgorithmQ] BMA progressively 
retrieves data from each storage node and performs 
decoding until the decoded symbols passes the CRC 
checks or failure is declared. However, decoding cannot 
be performed incrementally. 

• BMA-genie knows a priori how many symbols are needed 
to successfully decode. BMA-genie decodes only once 
after retrieving sufficient number of symbols. Note that 
BMA-genie is impossible to implement in practice, and 
is included for comparison purpose only. 

• IRD is the proposed progressive data retrieval algorithm. 

7.1 Total computation time 

Figure HJa) and @{b) illustrate the computation time (in log 
scale) spent in decoding when k = 101 and k = 401, 
respectively. Note that k = k in these simulations. The storage 
overhead n/k is 10.13 and 2.55 with the maximum number 
of errors correctable being 46 1 and 311. From Figure |4] we 
observe that the BMA and IRD computation time increases as 
p increases. But the rate of increment in IRD is much slower. 
When k is small or the redundancy is higher (Figure@ta)), IRD 
is faster than the genie-aided BMA. This is because in the 
genie-aided BMA, the computations of erasure polynomials 
(with 0((n — k) 2 )) dominate the decoding time when p is 
small. In contrast, IRD does not compute erasure polynomials. 

In Section [5] the data collection costs were shown to depend 
on k. We quantified the encoding and decoding computational 
complexity. The evaluation results-for a given Byzantine node 
rate p = .2, k = 200 data nodes, and a (1023, fc)-MDS code, 
where k £ {50, 100, 550}-are consistent with the analysis 
in Section [5] encoding computational costs are invariant of k, 
and are relatively insignificant. However, decoding computa- 
tional costs increase according to Eq. (TT~3T >- 

7.2 Decoding Breakdown 

We break down the decoding computation time to understand 
the dominant operations in the algorithms as the error prob- 
ability increases. The break down includes the time to find 
the error-locator polynomial (elp-time), find the error locations 
(chien-time) and solve for the information polynomial (inv- 
mat-time). This breakdown is also illustrated in Figure Q] 
where the 1st and 2nd blocks shows the elp-time, while the 
3rd and 4th blocks give the chien-time and inv-mat-time, 
respectively. 

When the error probability is low (Figure EJa)), computation 
of error-locator polynomials dominates for small k, while the 
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matrix inversion time becomes significant when k is large. In 
our implementation, the cost of a matrix inversion is quadratic 
in the number of symbols decoded. Chien search though 
asymptotically is the most time consuming procedure, it can 
be performed quite fast. When the error probability is high 
(Figure |5Jb),(c)), computation of error location polynomials 
dominates except in IRD. Also, from Figure|5] we observe that 
the computation time in matrix inversion is almost negligible 
(on the order of tens of milliseconds) in BMA and IRD, and 
is comparable to that in BMA-genie (recall that BMA-genie 
knows the number of errors in advance and thus performs 
matrix inversion only once). This is because even though there 
are more errors with larger p (and thus more iterations), the 
decoding algorithm is likely to fail in or before Chien search 
(e.g., Algorithm 12 (Line |2]i). Thus, in most cases, BMA and 
IRD perform matrix inversion once. 

From the experiments, IRD is more efficient because it 
utilizes intermediate results from previous iterations. Up to 
20 times speed up can be attained, relative to BMA. 



8 Conclusions 

We have developed a communication-optimal algorithm to 
guarantee data dependability and availability for distributed 
storage systems, in the midst of Byzantine-faulty and crash- 
stop nodes. The communication cost for data retrieval is 
minimized by utilizing intermediate computation results and 
collecting only the minimum data required for successful data 
reconstruction. The efficient encode and decode primitives 
serve as a fundamental building block for distributed depend- 
able storage systems. An analytical model to evaluate the 
communication complexity of our incremental data retrieval 
algorithm is provided. Moreover, our previous work restricted 
k = k, a constraint that is unrealistic for distributed networked 
storage systems. In this paper, the constraint is eliminated, 
and k is invariant of the number of data nodes, k. Finally, 
our implementation results show that our progressive scheme 
outperforms the state-of-the-art scheme by a factor of 20 in 
computation costs. Moreover, they are consistent with our 
analytical model, for any k and any Byzantine node rate. 



Appendix 

Proof of Theorem Q] 

Let A v be the event that there are v compromised storage 
nodes in the network, and Bi be the event that the error- 
erasure decoding algorithm executes i times when it completes 
successfully . Note that i also indicates how many errors 
the error-erasure decoding algorithm has corrected since our 
proposed scheme asks for extra data to correct one more error 
in each iteration. 

Therefore, the average number of accesses is given as, 

N(n,k) 

n-k min(t),L 2 i^J,n— 

= 5^Pr(A,) (k + 2i)PT(Bi\A v ) 

i=0 

' min(t;, [ "2 ^ J ,7l—v—k) 

1 - J2 P<Bi\A v ) 



v=0 
n — k 



v=0 



\ 



i=0 



(16) 



+ ^ nPi(A v ) . 

v—n—k+l 

The first term gives the average number of accesses in a 
successful run. The third and second terms correspond to 
the first and second failure cases discussed in Section 15.2.21 
respectively. Pr( J 4„) is simply given as 



p v (i-pY 



To determine Pr(Bi\A v ), we first derive Pr(.Bo| A„), i.e., 
the probability that only erasure decoding is needed. Clearly, 
Bq occurs if the first k copies of data are from healthy nodes. 
Therefore, Yx{Bq\A v ) = ("jfO/GD- Note this results holds 
even when v > L 2 ^] ■ 

When i > 0, let A(l) and B(l) represent the number of 
erroneous data and correct data received at the data collector 
after accessing I = k + 2i live storage nodes. Clearly, A(l) + 
B(l) = I. The event Bi occurs under the following conditions, 

(i) A(2i + k - 1) = i and B(2i + k - 1) = i + k - 1; and 

(ii) B(l) - A(l) < k,Vl <2i + k-l; and (iii) A(2i + k)=i 
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Fig. 5. Average computational time breakdown for decoding one (1023, fc)-MDS codeword, k e {101,401}. Because 
IRD progressively decodes, its performance does not deteriorate with an increasing Byzantine node rate, p. 



and B(2i + k) = i + k. Evolution of A(l) and B(l) can be 
modeled as a lattice path from the origin (0,0) to (i, i + k) 
in the A-B coordinate system, recording the running totals as 
more nodes are accessed. Condition (i) implies the path has to 
go through the point (i, i + k — 1). Condition (ii) implies that 
the path never intersects the line y = x + k except in the ith 
step. Condition (iii) implies the last data retrieve needs to be 
from a healthy node. The lattice path is the result of directional 
random walks, where each move is conditionally independent 
of the previous moves given the current coordinates. At step 
I, the probability A(l + 1) = A{1) + 1 and B(l + 1) = B(l) 
is given by 

v - A(l) 
n — l ' 

since there are n — I and v — A{1) remaining nodes and com- 
promised nodes, respectively. The probability A(l + 1) = A(l) 
and B(l + 1) = B(l) + 1 is given by 

n-v-B(l) 

n — I 

since there are n — I and n — v — B(l) remaining nodes 
and healthy nodes, respectively. Therefore, the probability for 
Condition (iii) to hold on Condition (i) is 

n — v — (i + k — 1) 



n- (2i + k- 1) 
The probability for Condition (i) to hold is simply, 



\i+k 



\2i+k-V 

Now, what remains to be derived is the probability of (ii) on 
condition (i). We use the bijection arguments due to Antone 
Desire Andre |fl9l , and count instead the number of "bad" 
paths that cross the line y = x + k. Consider the point (i — 
1, i + k) which is the reflection of the point (i, i + k — 1) along 
the line y = x + k. Clearly, the point (i — 1, i + k) is above 
the line y = x + k. Thus, all paths from (0, 0) to (i — 1, i + k) 
must hit the line y = x + k at least once. Consider the first 
time such a path P hits the line y = x + k at (j,j + k). 
After this point, the remaining number of correct data along 
this path is i — 1 — j and that of erroneous data is i — j. Now 



consider a path P' coinciding with P up to point (j, j + k) 
and afterward it has exact opposite branches to P. That is, 
the results of data retrieval are switched afterward, namely, all 
data from compromised data are counted toward B(l) and all 
data from healthy nodes are counted toward A(l). As a result, 
A{1)= j + i-j = i and B(l) = j + h + i-l—j = i + h-l 
for P'. Thus, for any path reaching — i + k), reflecting the 
remainder of the path after it first hits y = x+k yields a "bad" 
path to (i, i+k— 1). Similarly, every "bad" path to (i, i+k— 1) 
has a corresponding such path to (i — l,i + k) that intersects 
with the line y = x + k by construction. This establishes a 
bijection between the set of "bad" paths to (i, i + k — 1) and 
the set of paths to (i — l,i + k). Clearly, there are ( 2 *^ fc 1 _1 ) 
"bad" paths. Therefore, the probability for Condition (ii) to 
hold on Condition (i) is, 



(2i+k — l\ 



k 



(17) 



To this end, we obtain the probability that the error-erasure 
decoding algorithm stops at the ith iteration when there are v 
compromised nodes as follows, Vi > 0, 



Pr(Bi\A v ) = 



/ n — v 
\i+k 



X) 



k n — v — (i + k — 1) 

x x — 

i + k n - (2i + k - 1) 



(18) 

We now have the average number of accesses and the 
probability of successful decoding given in Eq. ( TPfl i and 
Eq. (Q3), respectively. 
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